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Abstract —The rapidly increasing number of cores available 
in multicore processors does not necessarily lead directly to a 
commensurate increase in performance: programs written in 
conventional languages, such as C, need careful restructuring, 
preferably automatically, before the benefits can be observed in 
improved run-times. Even then, much depends upon the intrinsic 
capacity of the original program for concurrent execution. The 
subject of this paper is the performance gains from the combined 
effect of the complementary techniques of the Decoupled Software 
Pipeline (DSWP) and (backward) slicing. DSWP extracts thread- 
level parallelism from the body of a loop by breaking it into 
stages which are then executed pipeline style: in effect cutting 
across the control chain. Slicing, on the other hand, cuts the 
program along the control chain, teasing out finer threads that 
depend on different variables (or locations), parts that depend 
on different variables. The main contribution of this paper is to 
demonstrate that the application of DSWP, followed by slicing 
offers notable improvements over DSWP alone, especially when 
there is a loop-carried dependence that prevents the application 
of the simpler DOALL optimization. Experimental results show 
an improvement of a factor of ?^1.6 for DSWP + slicing over 
DSWP alone and a factor of ^2A for DSWP + slicing over the 
original sequential code. 
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I. Introduction 

Multicore systems have become a dominant feature in 
computer architecture. Chips with 4, 8, and 16 cores are avail¬ 
able now and higher core counts are promised. Unfortunately 
increasing the number of cores does not offer a direct path 
to better performance especially for single-threaded legacy 
applications. But using software techniques to parallelize the 
sequential application can raise the level of gain from multicore 
systems m 

Parallel programming is not an easy job for the user, 
who has to deal with many issues such as dependencies, 
synchronization, load balancing, and race conditions. Eor 
this reason the role of automatically parallelizing compilers 
and techniques for the extraction of several threads from 
single-threaded programs, without programmer intervention, 
is becoming more important and may help to deliver better 
utilization of modern hardware m 

Two traditional transformations, whose application typi¬ 
cally delivers substantial gains on scientific and numerical 
codes, are DOALL and DOACROSS. DOALL assigns each 
iteration of the loop to a thread (see figure [T]), which then may 
all execute in parallel, because there are no cross-dependencies 
between the iterations. Clearly, DOALL performance scales 
linearly with the number of available threads. The DOACROSS 
technique is very similar to DOALL, in that each iteration is 
assigned to a thread, however, there are cross-iteration data 
and control dependencies. Thus, to ensure the correct results, 
data dependencies have to be respected, typically through 
synchronization, so that a later iteration receives the correct 
value from an earlier one as illustrated in figure (figure 
13, (m. DOALL and DOACROSS techniques depend on 
identifying loops that have a regular pattern ifTSl . but many 
applications have irregular control flow and complex memory 
access patterns, making their parallelization very challenging. 
The Decoupled Software Pipeline (DSWP) has been shown to 
be an effective technique for the parallelization of applications 
with such characteristics. This transformation partitions the 
loop body into a set of stages, ensuring that critical path 
dependencies are kept local to a stage as shown in figure 

Each stage becomes a thread and data is passed between 
threads using inter-core communication 0. The success of 
DSWP depends on being able to extract the relatively fine- 
grain parallelism that is present in many applications. 

Another technique which offers potential gains in paral¬ 
lelizing general purpose applications is slicing. Program slicing 
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Fig. 1. DOALL Technique adopted fromjj) 




Fig. 2. DOACROSS Technique Fig. 3. DSWP Technique 
adopted from (2) adopted from (2) 


transforms large programs into several smaller ones that exe¬ 
cute independently, each consisting of only statements relevant 
to the computation of certain, so-called, (program) points. The 
slicing technique is appropriate for parallel execution on a 
multi-core processor because it has the ability to decompose 
the application into independent slices that are executable in 
parallel ifTSll . 

This work explores the possibility of performance benefits 
arising from a secondary transformation of DSWP stages by 
slicing. Our observation is that individual DSWP stages can 
be parallelized by slicing, leading to an improvement in per¬ 
formance of the longest duration DSWP stages. In particular, 
this approach can be applicable in cases where DOALL is not. 

The proposed method is implemented using the Low level 
virtual machine (LLVM) compiler framework O. LLVM 
uses a combination of a low level virtual instruction set 
combined with high level type information. An important part 
of the LLVM design is its intermediate representation (IR). 
This has been carefully designed to allow for many traditional 
analyses and optimizations to be applied to LLVM code and 
many of which are provided as part of the LLVM framework. 

The remainder of the paper is organized as follows: the 


{ 

SI: Slicel (cur); 

S2: Slice2 (cur); 

} 

List *cur = head; 

L: for (; cur != NULL; 

cur = cur->next) 
X: Work(cur); 


Fig. 4. Sliced loop body with recurrence dependency 


2 double ss=0; 

3 int i; 

4 double a[0]=0; 

5 while ( node != Null) { 

6 Calc (node->data,a[1] 

7 & a[1 +1); 

8 i++; 

9 node=node->next; 

10 } 

11 


1 Calc (int M, 

2 double da_in, 

3 double* da_out) { 

4 int j; 

5 b[0]=0; 

6 for (j=0;j<M;j++) { 

7 m+=da_in+seq(j); 

8 (*da_out) += 

9 da_in+cos(m); 

10 b [ j]=b[j] +XX (m); 

11 } 

12 } 


Fig. 5. Source program 


next section 0 describes how DSWP may be combined 
with backward slicing, then section gives details of the 
implementation. Section [Ivjpresents some experimental results 
from the application of the automatic DSWP Slicing trans¬ 
formation. Finally in section [Vj we survey related work and 
conclude (section [Vl| ) with some ideas for future work. 

IT DSWP Slicing Transformation 

The performance of a DSWP-transformed program is lim¬ 
ited by the slowest stage. Thus, any gains must come from 
improving the performance of that stage. The main feature of 
the proposed method is the application of backward slicing 
to the longest stage emerging from the DSWP transformation. 
This is particularly effective when that stage includes a func¬ 
tion call. 

To illustrate the method, consider the example in Figure 
DSWP partitions the loop body into the parts labelled L and X, 
then we slice X to extract SI and S2. Consequently, instead of 
giving the whole of stage X to one thread, it can be distributed 
across n threads, depending on the number of slices extracted, 
with in this case, one core running L (the first stage) and two 
more running SI and S2 (the slices from the second stage). 

However, while there are potential gains from splitting the 
loop body into several concurrent threads, there is still the cost 
of synchronization and communication between threads to take 
into account. To minimize these overheads we use lock-free 
buffers (H. As a result, producer and consumer can access the 
queue concurrently, via the enqueue and dequeue operations. 
This makes it possible for the producer and consumer to 
operate independently as long as there is at least one data 
element in the queue. 

III. Implementation of DSWP Slicing 

We build on earlier work by Zhao and Hahnenberg m who 
implement DSWP in LLVM. We have extended that code with 










assign the SCCs that represent the outer loop body to the 
first thread and the n extracted slices to n threads. 




Data dependenca 
Control dependenca 


Input: A PDG, set of empty list associated, 
one for each node identifier(variable in the 
slicing list). 

Output: Slice for each node identifier(variable). 


Fig. 6. Program Dependency Fig. 7. DAG of SCCs 
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Algorithm: 

- Make all PDG nodes as not visited 


1 Slice_l(M,da_in){ 

2 int j; 

3 for (j=0;j<M;j++) { 

4 m+=da_in+seq(j); 

5 (*da_out) += 

6 da_in+cos(m); 

7 } 

8 } 


1 Slice_2(M,da_in,da_out){ 

2 int j; 

3 b[0]=0; 

4 for (j=0;j<M;j++) { 

5 m+=da_in+seq(j); 

6 b [ j] =b [ j] +XX (m) ; 

7 } 

8 } 


Fig. 8. Slice 1 on da_out Fig. 9. Slice 2 on b [ j ] 


backward slicing and a decision procedure to determine when 
it is worth applying the transformation. The transformation 
procedure is based on the algorithm for DSWP proposed by 
Ottoni et al (D. It takes as input L, the loop to be optimized, 
and modifies it as a side-effect. The details are as follows: 


- ComputeASlice(exit node) 


ComputeASlice ( node n){ 

if node is not visited 
Mark node n as visited 
Add the instructions of n to the set 
associated with node n 
For each node m( instruction)in which 
node n depends ComputeASlice(m) 

Add the content of the set 

associated with node m to the set 
associated with node n 

} 

Fig. 10. The ComputeAllSlice algorithm. Adopted from [T| 


1) Find candidate loop: This step looks for the most 
profitable loop to apply DSWP -f Slicing. We collect static 
information about the program and then use an heuristic 
to estimate the number of cycles necessary to execute all 
instructions in every loop in the program. The loop with 
the largest estimated cycle count and containing a function 
call is chosen. This is a first approximation selection 
procedure and clearly a more sophisticated version can 
and should be substituted in due course. 

2) Build the Program Dependency Graph (PDG): The 
subject is the loop to be parallelized. Figure shows that 
the solid lines (red) denote data dependency and dashed 
lines (black) control dependency. 

3) Build strongly connected component (SCC) DAG: 
In order to keep all the instructions that contribute to 
a dependency local to a thread, a Strongly Connected 
Component(SCC) is built, followed by the DAG for the 
SCCs. Consider the code in figure The loop (lines 

traverses a linked list and calls the procedure Calc. 
Figure [7] shows the DAGscc of the PDG of the program 
on the left had side of figure ^ In the procedure Calc, 
there are loop-carried dependencies that make DOALL 
inapplicable. DOACROSS is only applicable with the 
addition of synchronization that may cost more than is 
gained. However, if we can extract independent short 
slices from this stage and execute them in parallel, the 
execution time for this long stage can be reduced. In 
this case, after DSWP partitioning, we extract two slices 
(Figures andwhere function seq is side-effect-free. 

4) Assign SCCs to threads: The previous step may result in 
more SCCs than available threads. In this case, we merge 
SCCs until there are as many as there are threads. In our 
example, we have a function call in the loop body. We 


5) Extract slice: In this part, a small slicing program is 
designed that has the ability to extract slices for the 
limited range of the case studies. The algorithm illustrated 
in figure [T^ is used to compute an intra-procedural 
static slice lIJ. n static slices from the function body are 
extracted as follows: 

In the first step, the PDG is built for the function body 
by drawing up the dependency table that has both control 
and data dependency (similar to the one above used to 
determine thread assignment). Secondly, the entry block 
for the function body is examined so as to identify the 
variables to be sliced and then the names of these are 
collected, being put on a slicing list. The ComputeASlice 
is called to extract a slice for every listed variable. Then, 
an attempt is made to isolate the control statement parts, 
such as loop or if statement, into another table called the 
control table. After collecting the control part instructions, 
these are added to the extracted slice, if one of the 
slice instructions is contained in this control parts. For 
each filtered variable in the slicing identifiers list, first, 
an empty list is associated with it and subsequently, all 
the PDG table entries are scanned to find which one 
matches the slicing identifier. If one is found, then all 
the instructions that have data or control dependency are 
added to the associated list. This procedure is repeated to 
all the instructions in the associated list and their operands 
and is not stopped until all the instructions and their 
operands are contained in this list or all the variables that 
represent the loop induction variables have been reached. 
After a set of slices has been extracted from the function 
body, they are filtered to remove redundant ones so as 
to avoid repeated calculation, which will happen if all 
the instructions in one of them have been included in 








another. For example ,if there are two slices and slice 1 
is completely contained in slice 2 and the second slice 
(slice 2) is longer than the first, then we will remove 
the former and keep the latter. This procedure is repeated 
for all n slices, the real number is obtained. In the case 
of figure two slice will be retracted for two variables 
da_out and sum. 

6) Insert synchronization: To ensure correct results, the 
dependence between threads must be respected and for 
pipeline parallelism to be effective, the overhead on 
core-to-core communication must be as low as possible. 
Hence, we use the FastForward circular lock-free queue 
algorithm IH. In order to determine the source and the 
destination of dependencies between the DSWP stages, 
we need to inspect function arguments. These arguments 
denote the data that will go in the communication buffers. 
The destination of a dependency appears in the body of 
a function and hence where the data must be retrieved in 
order for the sliced stages to work correctly. 

IV. Experimental Results 

This section discusses the results obtained from the appli¬ 
cation of the automatic implementation of the proposed method 
that we presented in section Several programs have been 
used as case studies. Some are artificial and others are taken 
from 121. The discussion examines two issues: (i) the effect 
of lock-free buffers on the performance of DSWP, and (ii) the 
results from the application of DSWP slicing, demonstrating 
how this method can improve the performance of long stage 
DSWP with different program patterns. 

A. Communication Overhead 

This section examines the impact of communication costs 
on the performance of DSWP. It is important for us to be 
able to quantify this cost because it is a critical factor in 
the decision procedure for whether to carry out the DSWP 
-h slicing transformation. We are also aware this cost will be 
platform dependent, which is why we provide details of our 
particular platform. In a production deployment, this aspect 
would have to be measured as part of a calibration process. 

Consider the program in figure [TT] We wish to execute this 
it by applying DSWP to the loop that takes the most execution 
time of the program. 

Initially, we partition the program into two parts, give each 
to a thread and execute the threads as a pipeline. The first 
thread handles lines HHH and the second, lines [T6}|2^ Two 
parameters play a vital role in determining the benefit (or 
otherwise) of DSWP, namely M and N. M affects the amount of 
work inside each thread by controlling the number of iterations 
in the inner loops, while N, in effect, determines the volume 
of data transfer between threads, by controlling the number of 
outer loop iterations. Figure \V2\ shows how changing the value 
of N (1-40) and M (1000-1000000) affects the execution time 
of the DSWP version compared to the sequential program. 
From N=6 and M=51000 the performance of DSWP becomes 
better than the sequential one. 

Furthermore the effect of the buffer size on the performance 
of DSWP is examined, for which the same program as in 
figure [TT] was employed. However this time the value of N was 


1 main() 

2 int N,M 

3 . 

4 rows=N; 

5 for(il=l; il < rows; il++) { 

6 for(z=l;z<M;Z++) { 

7 sum = 0; 

8 for(a=l; a<10; a++) 

9 sum = sum + image[il] 

10 *mask_l[a]; 

11 if(sum > max) sum = max; 

12 if(sum < 0) sum =10; 

13 if(sum < out_image[il]) 

14 out_image[il] = sum; 

15 } 

16 for(zl=l;zl<M;zl++) { 

17 suml = 0; 

18 for(al=l; al<10; al++) { 

19 suml = suml + image[il] 

20 * mask_2[al]; 

21 if(suml > max) suml = max; 

22 if(suml < 0) suml = 10; 

23 if(suml > out_image[11]) 

24 out_image[11] = suml; 

25 } 

26 } 


Fig. 11. Sequential version of program to evaluate DSWP overheads 
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Fig. 12. Effect of N and M on DSWP 


fixed to 1,000 and M to 10,000 and the only parameter that was 
changed was the buffer size. That is, was varied between 10 
and 1000, with the execution time of the program being only 
slightly changed during the during the execution(2 to 5 ms) 
which was because it was assumed that this was the amount 
of time needed to create the link list. As a result, it can be 
concluded that the effect of buffer size on DSWP is trivial. 

B. Combining DSWP and slicing 

We now examine the effect of combining DSWP and 
slicing by applying slicing to the long stage coming out of 
the DSWP transformation. The sample programs that we study 
here all exhibit an imbalance between the two stages of the 
DSWP, i.e the number of instructions in the outer loop is 
less than the number of instructions in the function body. 
The addition of slicing permits some degree of equilibration. 
Two of the sample programs are artificial (linkedlist2.c and 
linkedlistS.c), while the remaining three (fft.c , pro_2.4.c and 
test0697.c) are genuine. 

For each of the case studies, we extract two slices from 
the function body, so that the maximum number of threads in 
general were four depending on whether the extracted slice 












TABLE 1. 


Platform Details 


Processor 

Intel(R) Core(TM) i7 CPU 

Processor speed 

2.93 GHz 

Processor Configuration 

1 CPU, 4 Core, 2 threads per Core 

Lid Cache size 

32 k 

Lli Cache size 

32 k 

L2 Cache size 

256 k 

L3 Cache size 

8192 k 

RAM 

4.GB 

Operating System 

SUSE 

Compiler 

GCC and LLVM 


returns value to the original loop or not. The data transferred 
between DSWP stages corresponds to the arguments of a 
function, which in our case studies are between one and four 
arguments. 

LLVM-gcc (the LLVM C front end, derived from gcc) and 
the LLVM compiler framework have been used to automate 
our method. In addition, manually transformed programs have 
been compiled using gcc in order to be able to compare manual 
and automatic results. Table [J summarises the technical details 
of the evaluation platform. 

Our automatic method uses two passes: 

1) The first pass carries out static analysis of all the loops in 
a program. For each loop it adds up the static execution 
time for each instruction in the loop body and also 
accumulates the execution time for the function bodies 
and stores these results in a table. 

2) The second pass chooses a loop to transform and construct 
the software pipeline. This uses the data collected in the 
previous pass to identify the highest cost loop, that also 
contain a function call. 

Next we look at the sample programs in more detail and 
at the results of the transformation process. 

fft.c An implementation of the fast Fourier trans¬ 
form in. The test program is a generalization of the program to 
make it work with N functions. We give the outer loop to the 
first thread and the fft function to the second thread. From the 
graph in Figure [T^ it is clear how the unbalanced long stage 
DSWP can affect DSWP performance, where it only improves 
slightly on the sequential program. We extract two slices from 
the loop body: the first is the computation of the real part and 
the second the imaginary part. Figure [T^ again shows loop 
speed up for DSWP Slicing in both manual and automatic 
forms. 

Pro-2.4.0 This program m computes the derivative of 
N functions. FI is the first derivative, F2 the second, D1 is the 
error in FI, and D2 the error in F2. Similar to the previous 
program we extract two slices from function body after giving 
the it to the second stage DSWP. As with the previous program 
we add some adaptations to the program and we generalize it 
to make it work for N functions. We set NMAX = 100000 and 
vary M from M=5 to M=30. Figure shows the execution 
time for sequential, DSWP, DSWP -i- slicing (manual) and 
DSWP slicing (automatic). Figure shows loop speed 
up for Pro_2.4 using DSWP -i- Slicing. 

test0697.c This program computes the spherical har¬ 
monics function, which is used in many physical problems 
ranging from the computation of atomic electron configuration 
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Fig. 13. Loop Speed up with three threads for test0697.c program 
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Fig. 14. Execution times for program test0697.c 
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Fig. 15. Loop speed up with three threads for fft.c program 
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Llvm-seq 
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dswp 

5 

0.702 
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0.558 

10 

1.375 

0.780 

1.391 

0.690 

1.244 

15 

2.058 

1.155 

2.078 

1.069 

1.934 

20 

2.750 

1.532 
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Fig. 16. Execution times for program fft.c 













































































to the representation of the gravitational and magnetic fields 
of planetary bodies. It has two function calls inside the loop 
body. The first, called the spherical-harmonic-value, gives the 
initial value to the second function argument, with this function 
being called the spherical-harmonic. The loop was divided 
into two parts, depending on the instruction latency execution 
time. The second function call, which represents the spherical- 
harmonic was allocated to the second thread, whilst the rest of 
the loop body containing the first function call was assigned 
to the first thread. Subsequently, two slices, c[] and s[], were 
extracted from the second function call by applying slicing 
technique on this part alone. With high values (40000) of 
L and M the execution time of this combination was better 
than for the sequential program. The number of threads was 
three with two communication buffers and the number of 
transferred function arguments was four. The results obtained 
by automatic and manual implementation for the sequential 
and DSWP_ Slicing versions, show that the former method 
gives ^ lA speed up compared with the sequential program 
in the LLVM environment(see columns 2 and 3 in the table in 
[T§. Moreover, columns 4 and 5 under the GCC environment 
shows that the speed up becomes ^1.5 after applying the 
slicing technique, while that for DSWP alone is only 1.3. 

linkedlist{2,3}. c The fourth program is another 
artificial program in two variants. The common feature is the 
traversal of a linked list of linked lists (in contrast to the use of 
arrays as in the other examples). The key difference between 
the variants is that the function called from the loop body 
does not return a value in the first (linkedlist2 . c), and 
does in the second (linkedlist3 . c). This allows us to 
demonstrate the cost of adding a buffer to the program. Two 
parameters affect the workload, namely the length of the first 
level list and the length of the second level list. 


In these test the length of the second level list is fixed at 
1000 elements, while the length of the first ranges between 10 
and 70, giving rise to the results shown in Figure and the 
execution times show in Figure [T^ The results for the second 
version of the program appear in Figure By comparing 
Figures and we can see how adding an additional 
buffer to communicate the return value from the one of these 
slices affects the execution time. This cost appears to have a 
marginally higher impact on the program using DSWP alone, 
making it slower than the original sequential program. 


V. Related work 

Weiser lUTl proposes the use of slicing for the parallel 
execution of programs. He states that slicing is appropriate 
for parallel execution on multiprocessor architectures,because 
of the ability to decompose the program into independent 
slices that execute in parallel without synchronization, or in 
shared memory by duplicating the computation in each slice. 
In general, it is claimed the slices are shorter and execute faster 
than the original program. However, there can be an arbitrary 
difference in the speed of individual slice execution, leading to 
an interleaving problem ,which is how to find - at runtime - 
the correct ordering for slice outputs. Consequently, after the 
output of each slice is received, it needs to be reordered to 
maintain the original program behaviour CD. 

Wang et al Ga introduce a dynamic framework to par¬ 
allelize a single threaded binary program using speculative 
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Fig. 17. Loop Speed up with three threads for linkedlist2.c program 
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Fig. 18. Execution times for linkedlist2.c program 
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Fig. 19. Loop speed up with three threads for linkedlistS.c program 
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Fig. 20. Execution times for linkedlistS.c program 


slicing. The major contribution of this work can be summarized 
as: 


• Parallelization of binary code transparently for multicore 
systems. 

• Slicing of the ‘hot’ region of the program, rather than the 
whole program. In addition, they used a loop unrolling 
transformation that can help to find more loop-level 
parallelism in a backward slice even in the presence of 
loop-carried dependencies and they propose an algorithm 
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Fig. 21. Loop speed up with three threads for Pro_2.4 program 
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Fig. 22. Execution times for Pro_2.4 program 


to determine automatically the optimal unrolling factor. 
They also demonstrate how this factor can affect the 
parallelism. 

• Slicing-based parallelism for irreducible control flow 
graphs. They define the backward slice using the program 
dependency graph instead of a program regular expres¬ 
sion. They also introduce the Allow list that uses post- 
dominator relationships to solve the ambiguity problem 
that was noted in the previous splicing solution csi ,which 
is the problem of determining the priority of the instruc¬ 
tions in each slice to get the the right output, where the 
slice output has to be reordered to maintain the original 
program behaviour. 

Rong et al im propose a method to construct a soft¬ 
ware pipeline from an arbitrarily deep loop nest, whereas the 
traditional one is applied to the innermost loop or from the 
innermost to outer loops. This approach is called the single¬ 
dimensional software pipeline (SSP). The (SSP) name came 
from the conversion of a multi-dimensional data dependency 
graph (DDG) to 1-D DDG. This approach consists of three 
steps. 

• Loop Selection: Every loop level is inspected and the most 
profitable one is selected to apply the software pipeline 
schedule. Two criteria can be used to determine which 
loop is more profitable to the software pipeline schedule 
are initiation rate and data reuse. 

• Dependency Simplification: simplify the dependency for 
the selected loop Lx from the multi-dimension data de¬ 
pendency graph (DDG ) to a single dimension which 
contains zero dependencies. 

• Final Schedule Computation: after obtaining the simpli¬ 
fied DDG, iteration points in the loop nest are allocated 
to slices: for any il in [0,N1], iteration point (i 1,0,..,0,0) 
is assigned to the first slice, (i 1,0,..,0,1) to the second. 


and so on. All il iterations can be executed in parallel, if 
there is no dependency between the iterations and there is 
unlimited resources. However, if there are dependencies, 
these iterations will be executed using software pipelines. 
To address resource limitations, the set of slices are 
divided into groups and relegated to succeeding groups 
until some resources are available. 

Rangan et al. ifTTIl introduced a new technique to utilize 
a decoupled software pipeline for optimizing the performance 
of recursive data structures (RDS) (e.g., linked lists, trees and 
graphs). For this kind of structure (RDS),difficulties have been 
encountered when trying to execute it in parallel, because the 
instructions of a given iteration of a loop depend on the pointer 
value that is loaded from a previous iteration. Therefore to 
address this problem, a decoupled software pipeline has been 
used so as to avoid stalls that are happening with the long 
variables-latency instruction in RDS loops. 

RDS loops consist of two parts, with the first containing 
the traversal code (critical path of execution) and the second 
representing the computation that should be carried out on each 
node traversed by the first part. By determining which program 
part is responsible for the traversal of the recursive data 
structure, the backward slice for this part should be identified 
and then decoupled software pipeline techniques can be used 
to parallelized these parts. The first part will be given to one 
thread and the second part to another. As the data dependency 
between these parts is unidirectional (the computation chain 
in the first part depends on the traversing chain in the second, 
but not vice-versa) the producer instruction is inserted in the 
first part and the consumer one in the second. 

Raman et al Ga introduce a parallel stage decoupled 
software pipeline (PS-DSWP). This technique is positioned 
between the decoupled software pipeline and DOALL. The 
reason for this combination is that the slowest stage of DSWP 
bounds the speed of DSWP - as we have noted - so this 
work exploits the ability to execute some stages of DSWP 
using DOALL. They use special hardware (synchronization 
array ifTTIl ) to communicate data between cores. For this reason, 
there is very low communication latency on the performance 
of PS-DSWPJToI, but the special hardware is experimental and 
not available on stock processors. 

Huang et al. 0 show that DSWP can improve performance 
if it works with other techniques. This usage called DSWP-I-, 
divides the loop body into stages. These stages are open 
to parallelization with another techniques like DOALL, LO- 
CALWRITE and SpecDOALL. After constructing a program 
dependency graph (PDG) of the loop and finding strongly- 
connected components (SCCs),the loop body is partitioned into 
stages. These stages can be optimized by choosing a suitable 
parallelizing technique for each stage. By giving a sufficient 
number of threads to the parallelization stages, DSWP-f can 
produce balanced pipelines (there is no big gap in the execution 
time of the work that is given to each stage). The results 
suggest that DSWP-f (a combination method) gives more 
speedup than using DSWP, DOALL, LOCALWRITE alone. 
It uses lock-free queue and producer and consumer primitives 
that are implemented in software to communicate data and 
control condition between threads. LOCALWRITE solves loop 
carried dependencies for irregular computation over arrays 



































based on array index determination at runtime, however it does 
not work in all cases. 

VI. Conclusion 

This paper introduces the idea of DSWP applied in con¬ 
junction with slicing, by splitting up loops into new loops 
that are amenable to slicing techniques. An evaluation of this 
technique on five program codes with a range of dependence 
patterns leads to considerable performance gains on a core-i7 
870 machine with 4-core / 8-ttireads. The results are obtained 
from an automatic implementation that shows the proposed 
method can give a factor of up to 2.4 speed up compared with 
the original sequential code. 

The contribution of this paper is a proof of the concept that 
DSWP and slicing can offer useful benefits and, moreover, that 
such transformation can be done automatically and under the 
control of an heuristic procedure that assesses the potential 
gains to be achieved. Consequently, there is much work to 
be done in respect of improving the collection of data and 
the decision procedure, as well as the integration of the 
technique into a non-experimental compiler environment. More 
specifically, we aim to increase the potential parallelism that 
can be extracted from the long stage DSWP. One of major 
issues with backward slice is the longest critical path (slice) 
creates a limit on parallelism. Insights from na suggest 
we can increase parallelism (number of extracted slices) by 
combining loop unrolling with backward slice in the presence 
of loop carried dependencies. 
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